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^ ■ ABSTRACT 

o 

o . 

CN . We discuss methods that can be used to estimate the spatial correlation length rg of 

^ . galaxy samples from the observed number of pairs with similar redshifts. The standard 

Q , method is unnecessarily noisy and can be compromised by errors in the assumed selec- 



tion function. We present three alternatives, one less noisy, one that responds differently 
to systematic errors, the third insensitive to the selection function, and quantify their 
performance by applying them to a cosmological N-body simulation and to the Lyman- 

■ break survey of galaxies at redshift z ~ 3. Researchers adopting the standard method 

■ could easily conclude that the Lyman-break galaxy comoving correlation length was 
^f) . ro ~ llh~^ Mpc, several times larger than the correct value. The use of our proposed 

. methods would make this error impossible, except in the small sample limit. When 

I A'gai ^ 20, major errors in estimates of tq occur alarmingly often. 

O 

■ Subject headings: galaxies: high-redshift — large-scale structure of universe — methods: 



o 



statistical 



1. INTRODUCTION 



^ , This paper was inspired by the work of Daddi et al. (2002, 2004), Blain et al. (2004), and 

others who have estimated the spatial clustering strength of a galaxy population from the observed 
positions of a small number of its members. Unable to fit a correlation function to the binned 
numbers of pair counts at different spatial separations, these authors counted the number nobs of 
galaxy pairs with redshift separation — Z2\ ^ and compared to the expected number n-exp 
for an assumed correlation function ^(r), which Blain et al. (2004) calculated to be 

nexp = ^/ dziP{zi) dz2P{z2) dQi d@2[l + i{ri2)l (1) 

where is the number of galaxies with measured redshifts, P{z) is the survey selection function,^ 
is the solid angle of the survey, ryi is the comoving distance between the points specified by 
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^i.e., the redshift distribution that would be observed for an infinitely large sample in the absence of clustering; 
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(01, and (©2,2^2)) and is the angular position of a galaxy within O,? They then restricted 
their attention to a family of correlation functions ^(r) = (r/ro)~^'^ that could be specified by a 
single parameter, tq, and estimated tq for their galaxy population by finding the value that made 
'^exp = nohs- Inspired by Poisson statistics, Blain et al. (2004) took as a la confidence interval the 
set of ro that satisfied 

1/2 1/2 

nohs - n^i,s < 'T'exp(ro) < riobs + ^^obs• (2) 

The approach can provide useful constraints on tq when other methods fail, but the implemen- 
tation described above is imperfect. Equation 1 is unnecessarily noisy and is more sensitive to the 
assumed selection function than to the clustering strength ^; equation 2 almost always underesti- 
mates the true uncertainty in ro- The goal of this paper is to draw attention to these shortcomings 
and to suggest modifications that make the analysis less subject to them. Section 3.1 discusses the 
effect of uncertainties in the selection function, showing that a 20% error in the assumed width of 
a Gaussian selection function can easily change the inferred value of rg by a factor of 2 or more. 
Sections 3.2 and 3.3 point out two additional sources of noise in equation 1 that are easily removed. 
My suggested revisions to the method are put forward in § 4 and tested with a cosmological TV- 
body simulation in § 5. Section 6 considers the uncertainty in the best-fit values of ro, showing 
that equation 2 is a poor approximation and suggesting a modification that leads to more realistic 
error bars. The main conclusions are summarized and discussed in § 7. To motivate the discussion, 
I begin in § 2 with an example that shows the standard analysis of redshift pair-counts going badly 
awry. 



2. A FAULTY ANALYSIS OF LYMAN-BREAK GALAXIES 

The analyzed sample consists of the 747 Lyman-break galaxies with apparent magnitude 
23.5 < 7^ < 25.5 in the fields 3c324, b20902, CDFa, CDFb, DSF2237a, DSF2237b, HDF, Q0201, 
Q0256, Q0302, Q0933, Q1422, SSA22a, SSA22b, and Westphal whose spectroscopic redshifts were 
published by Stcidcl ct al. (2003). The size of the observed fields varied but was typically 9' x 9'. 
I calculated the observed number of pairs with comoving radial separation Z < 20h~^ Mpc in each 
field individually. Summing over all fields, a total of riobs = 2539 pairs were found with comoving 
radial separations in this range. Since the Lyman-break technique selects galaxies over a broad 
range of redshifts 2.3 ^ z ^ 3.7, I approximated the selection function P{z) as a Gaussian with 
mean redshift n = 3.0 and standard deviation (Jsei = 0.4. To calculate the expected number of 
pans with Z < 20h~^ Mpc in the ith field for a given value of ro, I inserted this selection func- 
tion into equation 1, assumed a correlation function slope of 7 = 1.6, and integrated numerically 
over the field's solid angle I set the expected total number of pairs ncxp(ro) equal to the sum 
of the expected number for each individual field. A value of ro = ll.OSh^^ Mpc was required 



^The variable is written in bold-face because two numbers are required to specify the angular position of an 
object on the sky. If a represents right ascension and 5 represents declination, d& can be interpreted as cos{S)dadS. 
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for nexp to equal Uohs, while rp = 10.82/i ^ Mpc made Uexp = '^obs ~ '"-obs ~ 11.33^ ^ 

1/2 

Mpc made nexp = ^obs + 'T^obs • ^ conclude that the correlation length for Lyman-break galaxies is 
ro = 11.1 ± 0.25/1"^ Mpc at the la level. 

As noted in the abstract, this estimate of ro is roughly 20a away from the value of ro c:^ 
4.0 lb 0.6h~^ Mpc measured by Adelberger et al. (2004). What went wrong? 



3. SOURCES OF ERROR 

3.1. Selection- function Uncertainties 

Most of the error in the previous section's estimate of ro came from the inaccurate model of 
the redshift selection function. Although it is not always acknowledged in analyses of this sort, 
assumptions about the selection function have a critical effect on the results. Figure 1 shows that 
in the example of § 2 the best-fit value of rg changes by more than an order of magnitude as the 
assumed width of the Gaussian selection function increases from ag^i = 0.2 to o"sci = 0.5. If we had 
adopted the correct width agei = 0.3 (Adelberger et al. 2004) instead of agei = 0.4, we would have 
found ro = 7.2 instead of ro = 11. l^"^ Mpc — significantly closer to the true value ro ~ 4h~^ Mpc. 

Unfortunately analyses similar to the one in § 2 are usually attempted when the sample size is 
extremely small, too small for crgei to be determined empirically. In this case it is difficult to know 
which to adopt among the possible values of ro suggested by plots similar to Figure 1. Although 
theoretical arguments may provide a reasonable estimate of the selection-function shape, it seems 
sensible to reduce as far as possible the dependence of the answer on the assumed shape. 

Approximating the selection function as a boxcar with half-width L, equation 1 can be rewritten 

(3) 

where C is an uninteresting constant and ^ is the spatially-averaged correlation function defined 
by equation 1. This form makes it easy to see why the implied value of ro can be so strongly 
affected by the assumed selection function. If the field size CI or redshift separation Az is large 
compared to ro, as is usually the case, ^ will be significantly less than unity. The change in 
nexp that accompanies a significant change in the correlation strength, \dncxp/dl'n(,\ = C^^/L, will 
therefore be considerably smaller than the change in nexp that accompanies significant changes in 
L, |(inexp/(ilnL| = C{1 + ^)/L, and minor errors in the assumed selection function will lead to 
major errors in the inferred value of ro. Although these results were derived for a boxcar selection 
function, similar results hold for other types. 

One way to reduce the method's sensitivity to the selection function is to design the experiment 
to maximize ^. Since ^ increases as Q, decreases, experiments with smaller fields-of-view are less 
affected by uncertainties in the selection function. In practice, however, the field-of-view is set 
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by the instrument that is used and observers are unHkely to want to discard much of their data. 

Decreasing Az is a more palatable option, but, owing to peculiar velocities and to uncertainties in 
galaxies' measured redshifts, it cannot be decreased arbitrarily far before genuine pairs begin to be 
missed. 10h~^ comoving Mpc is a rough lower limit for most surveys. Unfortunately this limit is 
large enough to ensure ^ ;$ 1 for likely fields-of-view. 

Another way is to use the statistic K (Adelberger et al. 2004) instead of nobs in the analysis. 
This adds noise but removes the sensitivity to the selection function almost completely. The 
approach is described in more detail below (§§ 4 and 5). 

3.2. Angular distribution of sources 

An additional shortcoming of equation 1 is its assumption that the sources with measured 
redshifts have unknown angular positions that are distributed uniformly across the observed region 
Q (see § A. 3). In fact the angular positions are known (how else were redshifts measured?) and are 
probably not uniformly distributed. Consider, for example, a situation where we obtained images 
across a region with radius r = 20', but were able to measure redshifts for only 2 galaxies. If these 
galaxies happened to have an angular separation of 4", they would be likely to lie at nearly the 
same redshift even if vq were small, while if they had a separation of 40' they would be unlikely to 
lie at the same redshift even if ro were large (Figure 2). Since the expected number of close redshift 
pairs for a known correlation length tq depends on the galaxies' angular separations, our attempts 
to infer tq from the number of pairs will be improved if we take the galaxies' actual separations 
into account. Neglecting this information adds noise to the analysis and can bias the results if the 
spectroscopically observed galaxies were not chosen at random. 

3.3. Redshift distribution of sources 

Figure 3 illustrates another source of noise. Suppose we have found a single galaxy at redshift 
Z2- How many other galaxies should we expect to find in the redshift interval Z2 — Az < z < Z2 + Az 
for an assumed value of ro? The answer depends on the distance between Z2 and the peak of the 
selection function. If Z2 lies near the peak, we would expect a large number of pairs even if tq were 
small; if Z2 lies in the wings we would expect few pairs even if rg were large. Since the galaxies in 
pencil beam surveys tend to lie in a small number of prominent spikes in the redshift histogram, 
the expected number of redshift pairs is strongly affected by the alignment or misalignment of the 
spikes with the peak of the selection function. Equation 1 is noisier than it needs to be because it 
ignores the locations of redshift spikes when calculating riexp- 
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4. ALTERNATIVES 

This section suggests alternate approaches that are less affected by the shortcomings discussed 
above. The first two are refinements in the calculation of n^xp] the last relies on a slightly different 
statistic. The notation we use is explained more fully in the appendix. 

As shown in the appendix, equation 1 (in its correctly normalized form, equation A16) gives the 
number of redshift pairs one should expect to observe given only the information that N galaxies lie 
somewhere in the field of view But what if we know angular positions of the sources? How does 
this change riexp? If Pi\Zij \ < £ \ 9ij) is the probability that a galaxy pair with angular separation 
9ij has comoving radial separation \Zij\ < i, then the expected total number of redshift-pairs should 
be equal to the sum over all pairs of P{\Zij\ < ^ | %): 

pairs 

n,,p=Y,P{\Zi,\<i\eij). (4) 

i>j 

This equation can be evaluated with the help of equation All. Using it in place of equation 1 will 
remove the noise and bias that arises from the angular positions of the sources. The estimated 
correlation length of Lyman-break galaxies in our example analysis (§ 2), reduced from ll.lh~^ 
Mpc to 7.2h~^ Mpc by the adoption of the correct selection function, is further reduced to 6Ah~^ 
Mpc when equation 4 is used instead of equation 1. The reduction from 7.2h^^ to 6.4/i~^ Mpc 
results partly from the fact the angular positions of galaxies with measured redshifts were clumped 
together into slitmask-sized regions, not distributed randomly across the field. 

How can we incorporate knowledge of the spike redshifts into the analysis? Suppose we know 
that one member of a galaxy pair with angular separation Oij has the redshift Zj. Then the probabil- 
ity P(\Zij \ < £ \ ZjOij) that the galaxies have radial separation | < ^ is given by equation A6. The 
expected total number of pairs in the sample with redshift separation less than i should therefore 
be equal to the sum of the probabilities for each unique pair, 

^ pairs 

Using equation 5 instead of equation 4 further reduces the estimated correlation length (in the 
example of § 2) to ro = 5.7h~^ Mpc. 

Equations 4 and 5 are as sensitive to errors in the selection function as equation 1. This 
sensitivity can be eliminated almost completely by using the K statistic of Adelberger et al. (2004) 
rather than nobs in the analysis. Letting nobs(0,^) stand for the observed number of pairs with 
comoving radial separation < | Zij \ < £, K is the ratio 

^ ^ ^obs(0,^) .gx 

-nobs(0,2£)- 
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As long as nobs(O) 2^) is large enough that 

/ nobs(0,^) \ (nobs(0,£)) 



nobs(0,2£)/ (nobs(0,2^)) 



(7) 



K will have expectation value 



IK) ^ jWM^ /8^ 
^ ' nexp(0,2^) 

(In this equation, nexp(0, t) can be calculated with equation 4, equation 5, or any number of variants; 
the value of K will not change significantly.) Adelberger et al. (2004) show that the right-hand 
size of equation 8 is almost entirely independent of the assumed selection-function width (Tsd when 
21 is small compared to (t^^\. If we find the value of ro that makes the right-hand side of equation 8 
equal the right-hand side of equation 6, we will have an estimate of the correlation length whose 
value does not depend on our assumptions about the selection function.^ This is our final approach 
to estimating tq. Applying it to the Lyman-break galaxy example of § 2 leads to an estimate 
ro = 4.0/i~^ Mpc that agrees well with the correlation length reported by Adelberger et al. (2004). 

The discrepancy between the correlation lengths estimated with equation 5 and 8 shows that 
the observed number of pairs with I < \Zij\ < 2£ is inconsistent with the hypothesis ro = 5.7h~^ 
Mpc that seemed (according to equation 5) to account for the number of pairs with < \Zij\ < i. 
This may indicate that the assumed selection function is incorrect or that the power-law ^(r) = 
(r/ro)~^'^ is a poor approximation to the correlation function for large separations. The estimate of 
ro will be made more robust against either possibility by limiting the analysis to pairs with smaller 
separations, say Oij < 300". In this case the estimated correlation lengths (=b standard deviation 
of the mean from field-to-field fluctuations) are 5.1 =b 1.1, 4.9 it 0.9, and ro = 4.4 ± l.lh~^ Mpc 
for equations 4, 5, and 8, respectively, in good agreement with each other and with the estimate 
ro = 4.0 ± 0.6h~^ Mpc from the angular-clustering analysis of Adelberger et al. (2004). 

The approaches of this section offer two additional benefits. First, the sum of one-dimensional 
integrals that they require is usually simpler to calculate numerically than the six-dimensional 
integral required by equation 1. Second, as we have seen, the form of the equations makes it easy 
to omit pairs with undesirable angular separations from the analysis. 



5. NUMERICAL SIMULATIONS 

Unimpressed by the heuristic arguments of the previous section, I tested its recommendations 
on simulated galaxy surveys generated from the publicly released GIF-ACDM simulation of struc- 
ture formation in a cosmology with J7m = 0.3, Oa = 0.7, h = 0.7, F = 0.21, as = 0.9. This 



^Provided the error in the assumed mean redshift is not large enough to alter significantly the mapping of redshifts 
and angles onto distances. 
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gravity-only simulation contained 256^ particles with mass 1.4 x IO^'^/i^^Mq in a periodic cube 

of comoving side-length 141.3/i~^ Mpc, used a softening length of 20h~^ comoving kpc, and was 
released publicly, along with its halo catalogs, by Frenk et al. (2000). Further details can be found 
in Jenkins et al. (1998) and Kauffmann et al. (1999). 

For the test, I made numerous mock pencil-beam surveys from the rcdshift z = 2.32 catalog of 
halos with M > lO^^'^M©, calculated ro for each mock survey with the approaches of equations 1, 
4, 5, and 8, then tabulated and compared the results. To generate a single mock pencil-beam 
survey from the cubical simulation, I concatenated numerous randomly selected volumes of size 
13 X 13 X 141.3/i~^ Mpc^ into a long parallepiped with dimension 13 x 13 x 1700/t~^ Mpc^. After 
converting the comoving coordinates of each halo in the volume into redshift and angle (for Qm = 
0.3, r^A = 0.7, with 1700/i"^ Mpc the redshift depth), I applied various selection effects to produce 
one mock pencil beam survey. Numerous additional mock surveys, each generated in the same 
way, were used in the analysis. The mock surveys are clearly not exact reproductions of the actual 
universe. They are discontinuous every 141. 3^~^ Mpc, do not include any evolution in structure 
from the back to the front of the volume, and have an incorrect power-spectrum on very large 
( ^ 141/i^^ Mpc) scales because they were extracted from a single lAl.Sh^^ Mpc cube. However, 
the methods of § 4 work for objects with any spatial distribution, as long as the correlation function 
is sharply peaked, and the simulated surveys are similar enough to actual redshift surveys to provide 
a meaningful preview of how equations 1, 4, 5, and 8, will behave in realistic situations. 

The results are summarized in Figure 4. All panels are for a simulated survey with a 10' x 10' 
field of view. The correlation function slope was fixed to 7 = 1.6 and £ = 20h'~^ Mpc was taken 
as the maximum pair separation. The panel on the upper left shows the distribution of estimated 
ro from the four techniques when the pencil beam surveys included N^^x = 200 galaxies each and 
had a Gaussian selection function with mean /i^ = 2.2 and r.m.s. = 0.35. I used the correct 
selection function in calculating tq for the idealized case of this panel, even though normally tq will 
be calculated from an assumed selection function that is at least somewhat incorrect. This panel 
provides a reference against which the others can be judged. 

The catalogs for the other panels were constructed in the same way, except as noted below. 
The middle left panel shows the effect of lowering A^gai from 200 to 20. The noise in ro increases 
significantly with catalogs so small. The estimates become biased because the dependence of ro on 
the number of pairs n is no longer approximately linear over the plausible range of n. Although 

no approach performs particularly well, the method of equation 8 is essentially unusable. This is 
because random fluctuations in pair counts often make n.obs(0,^) = '?^obs(0, 2£), and the equivalent 
relationship for riexp requires ro — > 00. (More formally, it is because equation 7 is no longer a good 
approximation. ) 

For the bottom left panel, the simulated galaxies' angular positions were concentrated towards 
the center of the field rather than being random: each galaxy's selection probability was multiplied 
by a Gaussian with a = 70" centered in the middle of the field, causing 90% of the galaxies in 



-8- 



a typical catalog to fall within a region of diameter 5' inside the larger 10' x 10' field. This was 
intended to mimic the sort of selection effect than can appear in multislit spectroscopic surveys. In 
this case equation 1 leads to biased results, since it makes incorrect assumptions about the galaxies' 
angular positions, while the three approaches of § 4 arc nearly unaffected. 

The upper right panel shows the what happens to the inferred value of if the expected pair 
counts are calculated under the erroneous assumption that the selection-function width is Gz = 0.5. 
(In all panels its true value is cr^ = 0.35.) Equations 1 and 4 fare the worst, producing estimates 
of ro that are two high by a factor of two. Equation 5 leads to smaller errors, but only because 
was overestimated; for underestimates it performs worse. Only equation 8 yields unbiased results. 

The middle right panel shows what happens when the assumed selection function has the 
correct width az = 0.35 but the incorrect mean, jjLz = 2.8, instead of the true value jXz = 2.2. 
Equations 1 and 4 produce underestimates of tq, because the selection function is assumed to be 
narrower in comoving units than it actually is. Equation 5 produces an overestimate, doing more 
harm than good in its mangled attempts to compensate for the alignment of the selection function 
with redshift spikes. Equation 8 remains satisfactory. 

The bottom right panel shows a worst case scenario, which may be closer than any other 
panel to actual cases found in the literature. The sample size is iVgai = 20, the data are subject 
to angular selection effects (modeled by a two dimensional Gaussian distribution that has 90% 
of sources within a region of diameter 7.2'), and the pair counts have been analyzed under the 
assumption that fj,z = 2.2, cr^ = 0.5 even though the true selection function has Hz = 2.2, cr^ = 0.35. 
The results here are so uncertain and biased as to be useless. Estimates rg > lOh^^ Mpc appear 
alarmingly often, compensated only by the common occurence of ro = 0. Adopting equation 4 or 5 
helps reduce the noise, but none of the approaches are likely to add significantly to the observer's 
prior knowledge of vq. 

6. UNCERTAINTIES 

Equation 2 produces a reasonable estimate of the uncertainty in the simulation results if the 
"Poisson" uncertainty n^^^ is replaced with the true uncertainty Var(nobs)^''^5 where Var(n) is 
short-hand for the variance of n. As Figure 5 shows, the two can differ significantly; the clustering 
of galaxies drives the variance in pair counts far above the Poisson value Var(n) = n. 

The variance of n-obs is easy to estimate for the ensemble of simulated surveys. As long as 
random errors dominate over cosmic variance, it can be estimated in real life by splitting a survey 
into many smaller subsamples, calculating the dispersion in riobs among the subsamples, measuring 
how the dispersion changes with subsample size, and extrapolating to the full sample size. Sample 
results are shown in Figure 5. For the Lyman-break survey, this approach leads to an estimated 
la uncertainty in ro of ~ 1.3h~^ Mpc, which agrees well with the value Uftf/AT^/^ ~ l.lh~^ Mpc 
implied by the field-to-field fiuctuations Uftf in the estimated value of ro from the iV = 15 individual 
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survey fields. 

The preceding discussion applies to values of ro estimated from equations 1,4, and 5, since in 
these cases ro is estimated by setting nobs = '^exp- The uncertainties are slightly more difficult to 
estimate in the case of equation 8. One approach, in this case, is to estimate the dispersion in ro, 
not nobs) among the subsamples, and extrapolate this to the full sample size. 

These procedures do not work well for small samples, but neither do the methods for estimating 
ro itself. I discuss this further in the summary section. 

7. SUMMARY 

This paper analyzed a method that has recently been used to estimate the spatial clustering 
strength of small galaxy samples. The method is imperfect. The estimate of tq (a) depends 
sensitively on the assumed selection function (Figure 1), (b) will be biased if the galaxies are not 
distributed approximated uniformly across the field (Figure 4), and (c) is strongly affected by the 
positions of galaxy overdensities relative to the peak of the selection function (Figure 3). 

I suggested three ways to mitigate these problems and tested my suggestions on simulated 
galaxy surveys and on the Lyman-break survey. Figure 4 provides a useful overview of the results. 
When there are no systematic errors, equation 5 produces the best estimates of rg and equation 8 
the worst. Equation 8 is robust against systematic errors, however, and continues to produce 
reasonable estimates in the presence of systematic effects that render the other approaches useless. 
Since the approaches respond differently to systematic and random errors, a sensible strategy is to 
estimate ro with all of them^ and look for consistency among the results. 

The sample analysis of the Lyman-break survey helps illustrate the paper's main points. An 
initial estimate of ro ~ llh^^ Mpc from equation 1 disagreed badly with the estimate ro ~ 4/i~^ 
Mpc from the robust equation 8, suggesting that the initial analysis must have had large systematic 
errors. The largest systematic error came from inaccuracies in the assumed selection function. 
Replacing it with a better model reduced the estimated values of ro to 7.2, 6.4, 5.7, and A.Oh~^ 
Mpc from equations 1, 4, 5, and 8, respectively. The differences were still not negligible compared 
to the random uncertainties (§ 6). The high value from equation 1 was due to artificial angular 
clustering of galaxies imposed by the survey's spectroscopic selection criteria. It alone among the 
estimators does not correct for this. The remaining systematic problems are not easy to trace. They 
could result from residual errors in the selection function or from changes in the correlation function 
slope at large separations. In any case, since the effect of systematic errors is minimized when they 
are small compared to the signal, I maximized the signal by limiting the analysis to angular pairs 
with smaller separations. As equation 3 shows, the number of pairs with large angular separations is 



''except equation 1; as far as I can tell, there is no situation where its performance is the best among the alternatives 
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more sensitive to low level systematics than to the clustering strength ^. Restricting the analysis to 

pairs with angular separation 9ij < 300", I obtained the estimates ro = 5.1h~^ , 4.9/i~^, 4:Ah~^ Mpc 
from equations 4, 5, and 8. Since the random uncertainty is ^ Ih^^ Mpc (§ 6), these estimates agree 
well with each other and with the value ro = 4.0 ± 0.6/i~^ Mpc favored by the angular-clustering 
analysis of Adelberger et al. (2004). 

This paper provides some support for the common prejudice against estimates of ro derived 
from small galaxy samples. The middle left panel of Figure 4 shows how large the random uncer- 
tainties are for a simulated sample of A/" = 20 galaxies with true correlation length ro = S.bh"^ 
Mpc in a 10' x 10' pencil-beam survey. Figure 6 may make the point more forcefully. I extracted 
numerous random subsamples of 10 galaxies from the 170-object Lyman-break galaxy catalog in 
the Westphal field (Steidel et al. 2003), calculated ro for each subsample with equation 1 using the 
true LBG selection function, and tabulated the results. The spread in estimated rg is enormous. 

In realistic situations, uncertainty in the assumed selection function is likely to be the worst 
source of systematic error. A skeptic might point out that this uncertainty will probably only be 
large in the small sample limit, where none of the approaches work well, and that my suggested 
alternatives are not much of an improvement when the uncertainty in the selection function is 
small (see, e.g., the upper left panel of Figure 4). This is true to a point, but it would be foolish to 
reject the ~ 30% reduction in random uncertainty that equation 5 provides relative to equation 1. 
According to Figure 5, a ~ 30% decrease in the uncertainty in ro for the LBG sample requires a 
~ 40% increase in the number of galaxies. Using equation 5 instead of 1 in the analysis is surely 
easier than requesting, obtaining, and reducing 40% more data. The methods of § 4 are far from 
perfect, but they improve significantly on their predecessor. 

I would like to thank the Florida Airport cafe in La Serena for its hospitality while the first 
draft of this paper was being written. My collaborators in the Lyman-brcak survey encouraged me 
to share the analysis with a wider audience. This work was supported by a fellowship from the 
Carnegie Institute of Washington. 

A. EXPECTED PAIR COUNTS FOR POWER-LAW CORRELATION 

FUNCTIONS 

We derive three simple results needed in the text. In each case, the notation P{AB\C) stands 
for the probability that A and B are both true if we know that C is true. According to this notation, 
P(2;i0i|2;2®2) is the probability of finding a galaxy at redshift zi and angular position ©i if we 
know that there is a galaxy at position Z2, &2, and P(2;i0i) is the probability of finding a galaxy 
at the first position if we know nothing about the positions of other galaxies. We assume that the 
reduced two-point galaxy correlation function, ^, is an isotropic power- law, ^(r) = {r/ro)~'^, which 
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implies that 

P{zi&i\z2&2) = P{zi&i)[l + (ri2/r-o)-^] (Al) 

where ri2 is the distance between the points specified by 2^1, ©i and 22, &2- Since the survey 
selection function docs not depend on sky position, P{zi&i) = P{zi)/Q where Q is the survey's 
solid angle and P{zi) is the expected redshift distribution for a single object in our survey. We 
adopt the shorthand Z12 = z\ — Z2 and Q\2 = |0i — 02 1, and use the capitalized variable to 
indicate comoving distance between redshifts z\ and Z2- 



A.l. Case 1: 

If we know that a galaxy with position Z2-, ©2 has a neighbor at angular position ©1, what is 
the probability that the neighbor has redshift z\1 In our notation, we are asking for P(2;i|z2©i©2), 
which can be derived from the correlation function with elementary probability identities: 

TD( I o r» ^ P(zi0i|z2©2) 

P(Z1|Z2©1©2 = TOO , , p. I A2) 

Jq dz\P{z'^&x\z2@2) 

P(zi)[l + e(ri2)] ^^3^ 



1 + a{rQ,-i,Qx2,Z2)P{z2) 

The second equality assumes that the selection function is independent of angular position and 
is roughly constant over the small radial separations where ^ is significantly larger than 0. It also 
assumes that / and g (defined in the following sentence) do not change significantly over the same 
small radial separations. For clarity we adopt the shorthand 

a(ro,7,^,^) = Tl\j{z)Q\^-^g-\z)(i{:^) (A4) 

where g{z) = c/H{z) is the change in comoving distance with redshift, f{z) = (1 + z)Da{z) 
is the change in comoving distance with angle, Da{z) is the angular diameter distance, /3(7) = 
B[l/2, (7 — l)/2], and B is the beta function in the convention of Press et al. (1992). 

The probability that the comoving distance \Zi2\ between zi and Z2 will be less than £ can be 
derived by integrating equation A3 over the appropriate range of zi: 

r+//^dziP(^10l|^2©2) 

Pi\Zi2\<i\z2ei2) = / /o I (^^) 

Jo dz{ P{z{@i\Z2&2) 



P{z2)[2eg-\z2) + a(ro, 7, 9i2, Z2)I{7,i, ^12, 22)] 



(A6) 



l + a{ro,'j,Oi2,Z2)P{z2) 
where I is related to the incomplete beta function of Press et al. (1992) through 

I{^,e,9,z) = Ul/2,{j-l)/2] (A7) 

with 

^ = WTWW ^^^^ 



- 12 - 



A.2. Case 2: 

What is the probabihty that the galaxy pair with known angular separation 612 has comoving 
redshift separation {Z^l < ^? The probability that a pair with angular separation Ou will have 
redshift separation Z12 is 

POO 

P{zi2\ei2)= dZ2P{z2\ei2)Pizu\z2ei2), (A9) 

Jo 

which implies that the pair will have comoving radial separation \Zi2\ less than i with probability 

r !T,-i/adZlP{z,@r\Z2@2) 

Jo l + a{ro, 7,012, z)P{z) ^ ' 



A.3. Case 3: 

What is the expected number of pairs with < £ if we know only that N galaxies lie 

somewhere in the solid angle angle fi? If ij represents the proposition that galaxy i lies within 
the surveyed solid angle ^, the expected number of pairs will depend on J^^dZi2P{Zi2\Iil2), the 
probability that a randomly selected pair in the survey has comoving redshift separation less than 
£. The conditional probability can be rewritten as 

P(^12 1/1/2) = roo ^^^''^'^'^ (A12) 

IZodZl2P{Zl2lll2) ^ ' 

and the unconditional probability can be expanded to 

P{Zi2hh)= J d@id&2dz2PiZi2@i@2hhz2) (A13) 

where the integrals in equation A13 extend over all space. If ©i is not within ft, P{Zi2&i&2lil2Z2) 
will be equal to 0. If ©2 is within $7, P{Zi2&i@2lil2Z2) will be equal to P{Zi2&i&2l2Z2) for the 
same reason that the probability of being in the Louvre and in Prance is equal to the probability of 
being in the Louvre. Since the same arguments apply to ©2 and I2, equation A13 can be simplified 
by omitting Ii and I2 from the right-hand side and restricting the angular integrals to the region 
Q. After expanding the integrand with the identify P{AB) = P(A\B)P(B), equation A13 becomes 

P{Zi2lll2)= [ d@id@2 [ dZ2P{Zi2&l\Z2&2)P{Z2@2)- (A14) 

J^i Jo 

The expected number of pairs with \Zi2\ < i is equal to the number of unique pairs multiplied by 
the probability that a random pair has \Zi2\ < £■ Substituting equation A14 into equation A12 
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and integrating over Z12, one finds 

NjN 1) In d&id@2 /o°" dz2 P{z2) /^-+/// ^^^^^"^^ + ^^^i^)] 
"'"^ 2 d0id02 /o°° P{Z2) /o°° fizi P(zi)[l + e(ri2)] 

iV(iV - 1) rfQidea dzP^(z)[2£g-Hz) + aX]/17^ 

which recovers equation 1, aside from the latter equation's imprecise normalization. 
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Fig. 1. — Dependence of the best-fit correlation length in § 2 on the assumed selection function 
width (Tsei- The point shows the result if a^ei is assumed to be 0.4 exactly. In fact CTgei will always 
be somewhat uncertain, and this is one reason that Poisson error-bars (shown on the point, and 
derived from equation 2) underestimate the true uncertainty. 
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Fig. 2. — The probability that two galaxies will have comoving radial separation Z12 < 20h^ Mpc 
as a function of the angle 612 between them. The actual Lyman-break galaxy selection function 
(see Figure 3) was used in calculating these numbers. The probability of having small redshift 
separations depends at least as much on the galaxies' angular separations as on their correlation 
length. This implies that angular separations should be treated carefully when estimating ro from 
the number of redshift pairs. 
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Fig. 3. — Dependence of the number of redshift pairs on the locations of galaxy overdcnsitics. The 
lighter shaded region in the background of both panels shows the Lyman-break galaxy selection 
function P{z). The darker shaded region shows the galaxy density p{z) observed in the field SSA22a, 
shifted by Az = —0.15 in the top panel and by Az = 0.25 in the bottom. The units on the y-axis 
are arbitrary. The galaxy clustering strength is the same in both panels, but the upper panel will 
have roughly 3.5 times as many pairs with small redshift separations, on average, since the galaxy 
overdensity is aligned with the peak of the redshift histogram and since the number of pairs is 
proportional to p^. Estimates of tq derived solely from the number of pairs can be led astray by 
chance alignments or misalignments of galaxy overdensities with the selection function. Section 4 
shows how to remove this unnecessary source of noise; see the discussion near equation 5. 
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Fig. 4. — Performance of the 4 methods on simulated galaxy surveys. Each panel shows the results 
for surveys generated with a single parameter combination (§ 5). Points labeled A, B, C, and D 
summarize the results for the methods that use equations 1, 4, 5, and 8, respectively. (Equation 1 
was actually used in its correctly normalized form, equation A16.) All fits assumed a correlation 
function slope 7 = 1.6 and adopted £ = 20^~^Mpc as the maximum pair separation. The circle 
marks the median estimate of rg; the estimates fell within the shaded region for 68% of the simulated 
surveys, and within the error bars for 90%. The horizontal dashed line shows the true value of 
ro, calculated by counting the number of pairs as a function of separation for all halos in the GIF 
catalog, then fitting a power-law correlation function to the result. The upper left panel is for a 
survey with N = 200 galaxies in a single 10' x 10' field where the true selection function (a Gaussian 
with = 2.2, az = 0.35) is used in the analysis. Survey parameters are varied in other panels. 
Middle left: N = 20. Bottom left: spectroscopic selection effects concentrate the survey galaxies 
near the center of the field. Upper right: a selection function with incorrect width (a^ = 0.5) is 

iiQprl in t.liA aTial"\^«is A/TiHrllp ricrlit.* a Qplpr'tinn fiTnrt.inn with inrnrrpr't, mpftn ( n^. = R\ is n«prl in 
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Fig. 5. — Dependence of the variance of pair-counts on the number of pairs. Results are shown 
for Lyman-break galaxies (filled circles) and for halos in the GIF simulation (open squares). To 
estimate the dependence for LBGs, we created numerous subsamples with different mean numbers 
of galaxies by eliminating a random fraction of galaxies from the actual LBG catalogs described 
in § 2. Seven sets of subsamples were created, with the eliminated fraction / = 0.98, 0.95, 0.9, 
0.8, 0.7, 0.6, and 0.5. The point with nobs — 10^'^, Var(nobs) — 10^'^ shows the mean and variance 
of the number of galaxy pairs with radial separation Z\2 < 20hr^ Mpc in the subsamples with 
/ = 0.5. The other points are defined similarly. Each GIF point (open square) shows (ri) and 
the variance in n for the ensemble of simulated pencil-beam surveys created for a single set of 
mock survey parameters (§ 5). These parameter sets include but are not limited to the ones 
shown in Figure 4. The "Poisson" approximation Var(n) = n (solid line) is poor for all values of 
nobs- A better approximation, for the LBG survey, is Var(n) = 1.56n^'^^ (dotted line). Different 
relationships will hold for different surveys, as the GIF results show, and this relationship should 

nnf. 5i««iTmprl in ntlipr Qit.nsit.inTis A spti«iK1p wa"\^ t.n psf.iTnsit.p \lar(rt\ fnr nt.lipr Qnr"\^PArQ i« tn r»rPi:it.p 
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Fig. 6. — Distribution of ro for 10-galaxy LBG subsamples extracted at random from the 170-objcct 
Westphal catalog of Steidel et al. (2003). Correlation lengths were estimated with the approach of 
equation 1. 



